Before starting this exercise, you should have had a brief introduction to using RStudio. If not, take a look at the Using RStudio worksheet.
Loading a package
Packages are extensions to R that add new commands.
Nearly everything we’ll do in this course uses the tidyverse package, so pretty much every project starts with the instruction for R to load this package. You can find out more about packages here.
For now, simply type (or copy and paste) the command in the grey box into line 1 of the Script window of RStudio.
Then, with your cursor still on line 1, press CTRL+ENTER (i.e. press the key marked ‘Ctrl’ and the RETURN or ENTER key together).
library(tidyverse)When you do this, line 1 is automatically copied to your Console window and run. Then, RStudio will print some text to the Console (shown in the white box, above). This text tells you that the tidyverse package has loaded (“attached”) some other pacakges (e.g. dplyr). It also tells you that the dplyr package changes the way some commands in R work (“conflicts”). That’s OK.
If you get an output that includes the word ‘error’, please see the guide to common problems on the DLE/teams.
Saving your script
You should notice that the name Untitled1 on the Script window has now gone red (or some other text, if you chose a different name for your file).
This is to remind you that your script has changed since the last time you saved it. So, click on the “Save” icon (the little floppy disk) and save your R script with some kind of meaningful name, for example briefguide.R.
The .R indicates that it is an R script.
Re-save your script each time you change something in it; that way, you won’t lose any of your work.
Loading data
Now, we’re going to load some data on the income of 10,000 people in the United States of America. I’ve made up this dataset for teaching purposes, but it’s somewhat similar to large open data sets available on the web, such as US Current Population Survey. Here’s how you get a copy of this data into RStudio so you can start looking at it.
- Copy or type the following command into your RStudio script window, and run it (i.e. press CTRL+ENTER while your cursor is on that line)
cpsdata <- read_csv("http://www.willslab.org.uk/cps2.csv")Explanation of the command
There are three parts to the command cpsdata <- read_csv("http://www.willslab.org.uk/cps2.csv"):
The first part of the command is
cpsdata. This gives a name to the data we are going to load. We’ll use this name to refer to it later, so it’s worth using a name that is both short and meaningful. I’ve called itcpsdatabecause it’s somewhat similar to data from the US Current Population Survey, but you can give data pretty much any name you choose (e.g. fart).The bit in the middle,
<-, is an arrow and is typed by pressing<and then-, without a space. This arrow means “put the thing on the right of the arrow into the thing on the left of the arrow”. In RstudioThe last part of the command is
read_csv("http://www.willslab.org.uk/cps2.csv"). It loads the data file from the URL given intocpsdata. The part inside the speech marks,http://www.willslab.org.uk/cps2.csv, is the URL of the data on the internet. You can copy and paste this URL into a web browser to see for yourself that the data are there.
This is a very convenient way of loading data into R. Sometimes though, your data won’t be available on the internet. In this case you can also upload data to the RStudio server, and we will show you how to do that in a future session.
Explanation of the output
R likes to print things in red sometimes – this does not always mean there’s a problem. If there’s a problem, it will actually say ‘error’. The output here tells us that R has loaded the data, which has eight parts (columns, or cols). It gives us the name of the columns (ID, sex, ...) and tells us what sort of data each column contains: character means the data is words (e.g. ‘female’), double means the data is a number (e.g. ‘42.78’) (more about the different types of variables).
If you get an error here, please see the common problems worksheet on the DLE.
Inspecting data
Load the CPS2 data for yourself from the URL provided above.
Save this as a variable called
cpsdataInspect the data using the Environment pane
If you have opened the data in the Environment pane you will that this data frame has 8 columns and 10000 rows.
Each row is one person, and each column provides some information about them. Below is a description of each of the columns. Where you see NA this means this piece of data is missing for this person – quite common in some real datasets.
Here’s what each of the columns in the data set contains:
| Column | Description | Values |
|---|---|---|
| ID | Unique anonymous participant number | 1-10,000 |
| sex | Biological sex of participant | male, female |
| native | Participant born in the US? | foreign, native |
| blind | Participant blind? | yes, no |
| hours | Number of hours worked per week | a number |
| job | Type of job held by participant: | charity, nopay, private, public |
| income | Annual income in dollars | a number |
| education | Highest qualification obtained | grade-school, high-school, bachelor, master, doctor |
Calculating a mean
One question we can ask of these data is “what is the average income of people in the U.S.?” (or, at least, in this sample).
In this first example, we’re going to calculate the mean income.
As you know, you calculate a mean by adding up all the incomes and dividing by the number of incomes. Our sample has 10,000 participants, so this would be a long and tedious calculation – and we’d probably make an error.
It would also be a little bit tedious and error prone in a spreadsheet application (e.g. Excel, Libreoffice Calc). There are some very famous cases of these kinds of “Excel errors” in research, e.g. genetics, economics.
In R, we can calculate the mean instantly, and it’s harder to make the sorts of errors that are common in Excel-based analysis.
To calculate mean income in R, we add the following command to our script and press CTRL+ENTER:
cpsdata %>%
summarise(mean(income))
> # A tibble: 1 x 1
> `mean(income)`
> <dbl>
> 1 87293.Your output will tell you the mean income in this sample – it’s the last number on the bottom right, and it’s approximately $87,000.
If you get an error here, please see the common errors worksheet on the DLE.
Explanation of the command
This command has three components:
The bit on the left,
cpsdata, is our data frame, which we loaded and named earlier.The bit in the middle,
%>%, is called a pipe. Its job is to send data from one part of your command to another. It is typed by pressing%then>then%, without spaces. Socpsdata %>%sends our data frame to the next part of our command. See how to type this quicklyThe bit on the right,
summarise(mean(income))is itself made up of parts. The commandsummarisedoes as the name might suggest: it summarises a set of data (cpsdatain this case) into a single number, e.g. a mean. Themeancommand indicates that the type of summary we want is a mean (there are other summaries, as we will cover later). Finally,incomeis the name of the column ofcpsdatawe want to take the mean of – in this case, the income of each individual.
Make sure you are 100% clear about the difference between <- and %>%. If you’re not, ask for an explanation in class now.
The main clue is to look at the direction of the arrows:
%>% sends data from left to right. We call this ‘piping’.
<- sends results from the right hand side, to a variable named on the left. This is called assignment.
Watch out that -> is not the same as %>%. The thin arrow is always for assignment. You won’t see if often, because it’s normally considered bad manners to use thin right arrows like this (they get confusing).
It really is worth learning the keyboard shortcuts for <- and %>% — you will be typing them a lot during the course. See the “Being Efficient” worksheet.
Grouping data
One of the most widely discussed issues concerning income is the difference between what men and women, on average, get paid. Let’s have a look at that difference in our teaching sample of 10,000 US participants.
In order to do this, we need to split our data into two groups – males and females. In R, the command group_by allows us to do this. In this case, we want to group the data by biological sex, so the command is group_by(sex). We pipe (%>%) the data in cpsdata to the group_by command in order to group it, and then we pipe (%>%) it to summarise to get a summary for each group (a mean, in this case). So, the full command is:
cpsdata %>% group_by(sex) %>% summarise(mean(income))
> # A tibble: 2 x 2
> sex `mean(income)`
> <chr> <dbl>
> 1 female 82677.
> 2 male 92137.Copy it into your script and run it (CTRL+ENTER). Women in our made-up sample get paid, on average, around 9,000 (9k) less than men. Of course, not every male gets 92k a year in the US, and not every female gets 83k. It seems very likely that the range of incomes earned by men and women overlap – meaning that if you picked one man and one woman at random, there’s a reasonable chance that the woman earns more than the man. We can look at this variation in pay using a graph.
Looking at variation using a density plot
The graph we’re going to draw is a density plot. If you recall histograms from school, it’s a lot like that. If not, don’t worry.
A density plot is a curve that shows how likely a range of incomes are. So, the higher the curve is at a particular income, the more people who have that income.
We’re going to produce what’s called a scaled density plot. The highest point on a scaled density plot is always one. This can make it easier to compare two groups, particularly if one group has fewer people in it than the other.
So here’s the command to do a scaled density plot for incomes, plotting men and women separately. Copy it into your script and run it (CTRL+ENTER).
cpsdata %>%
ggplot(aes(income, colour=sex)) +
geom_density(aes(y=..scaled..))Explanation of command
Here’s what each part of this command means:
cpsdata- The data frame containing the data. You created this in the last worksheet.%>%- A pipe. As in the last worksheet, this pipe carries the data incpsdatato the next part of the command, which does something with it.ggplot()- This means ‘draw me a graph’. All graphs we use in these worksheets use the Grammar for Graphics (gg) plotting commands, so they’ll all include the commandggplot.aes()- Short for aesthetics (what things look like). It means ‘This is the sort of graph I want’.income- I want a graph of the data in theincomecolumn ofcpsdatacolor=sex- I want you to give me two graphs on top of each other, in different colours. One colour for men, a different color for women. Use thesexcolumn ofcpsdatato work out who is male and who is female.geom_density()- I want this graph to be a density plot.aes(y=..scaled..)- I want this density plot to be scaled (see above).
Discussion of output
Your graph will appear in the bottom-right window, and should look like the one above. You’ll notice that the two lines seem basically on top of each other … but they can’t be because we know the two groups differ in mean income by over nine thousand dollars! We have a problem to solve…
Dealing with extreme data points
The problem is one of scale – there are a small number of people who earn very high salaries. In fact, both the highest-paid man, and the highest-paid woman in our sample earn considerably more than one million dollars a year.
Filtering data
Somehow, we need to deal with the fact that a few people in our sample are very well paid, which makes the difference between men and women hard to see on our graph, despite the difference being over nine thousand dollars a year.
One of the easiest ways around this is to exclude these very high salaries from our graph.
The vast majority of people are paid less than 150k a year. So, let’s restrict our plotting to just those people. We do this using the filter command. It’s called filter because it works a bit like the filter paper in a chemistry lab (or in your coffee machine) – stopping some things, while letting other things pass through. We can filter our data by telling R what data we want to keep. Here, we want to keep all people who earn less than £150k, and filter out the rest. So the filter we need is filter(income < 150000), where < means “less than”.
We’ll be using this dataset of people with <$150k incomes a few times, so we’re going to give it a new name, cpslow (or any other name you want, e.g. angelface )
So, what we need to do is pipe (%>%) our cpsdata data to our filter(income < 150000), and use an arrow, <-, to send this data to our new data frame, cpslow. Recall that <- sends the thing on its right to the thing on its left, so the full command is:
cpslow <- cpsdata %>% filter(income < 150000)We can take a look at this new data frame by clicking on it in RStudio’s Environment window (see video here if you’re not sure how). By looking at the ID numbers, you can see that some people in our original sample have been taken out, because they earned at least 150k.
Now, we can plot these filtered data in the same way as before, by changing the name of the dataframe from cpsdata to cpslow.
So start with the command cpsdata %>% ggplot(aes(income, colour=sex)) + geom_density(aes(y=..scaled..)), copy it onto the next line in your script, make that change, and press CTRL+RETURN.
If you’ve got it right, your graph will look like this:
At first glance, the two distributions of incomes still look similar. For example, the modal income is at quite a low income, and that income is quite similar for both men and women. However, on closer inspection, you’ll also see that the red line (females) is above the blue line (men) until about 25-50k, and below the blue line from then on. This means that more women than men earn less than 50k, and more men than women earn more than 50k.
So, the gender pay gap is visible in this graph. The graph also illustrates that the difference in this sample is small, relative to the range of incomes. This doesn’t mean that the gender pay gap is less (or more) important than income inequality. These kinds of questions of importance are moral, philosophical, and political. Data cannot directly answer these kinds of questions, but they can provide information to inform the debate.
As we’ll see later, this type of graph is also crucial to inform our choice of statistical models (like regression or Anova): Without a clear sense of what the data look like we can make bad decisions in our analyses.
Consolidation exercise
This exercise consolidates what you’ve learned so far.
The task is to further examine the sub-sample of participants who are living in the US, and earning less than $150k (cpslow).
Specifically, the question to answer is whether people born in the US earn more. In order to do this, you should calculate the mean income for each group, and produce a density plot with one line for each group. Below are the answers you are aiming for:
> # A tibble: 2 x 2
> native `mean(income)`
> <chr> <dbl>
> 1 foreign 51423.
> 2 native 58905.
Previously, we calculated the mean salary of men and women.
- Why might it be a better idea to calculate the median?
Because the data are strongly skewed, the median may be a better summary of the central tendency (the middle).
- Adapt the commands above to calculate the median instead. What is the median salary for women: , and for men: (note this is for all women, not just those earning > 150k).
cpsdata %>%
group_by(sex) %>%
summarise(med=median(income))
> # A tibble: 2 x 2
> sex med
> <chr> <dbl>
> 1 female 52558.
> 2 male 61746.Undergraduate stats in R
All of the statistics you will have learned at undergraduate level can be produced in R. Here we cover simple examples of:
- A t-test
- A correlation
If you have no memory of t-tests or correlations, you might want to take time to work through these expanded guides from our undergraduate course at a later date:
We’ll run these statistics on an example dataset which is built into R, called mtcars. We can look at this data using the glimpse function (this is loaded with tidyverse, so if you get an error make sure that is loaded too):
mtcars %>% glimpse()
> Rows: 32
> Columns: 11
> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…Explanation of the glimpse output
glimpse produces a list of all variables in the dataset, tells us what type they are, and lists however many obserervations from the dataset that will fit on a single line.
The type of all variables in mtcars is dbl. This is short for ‘double-precision number’; for now, just know that dbl means a number.
Other types include :
int— short for ‘integer’ variable, so only contains whole numbers (e.g. a participant id number)chr— short for ‘character variable’, which will contain text (e.g. an email address)fct— short for ‘factor’. i.e. a categorical variable (e.g. MCQ responses)ord— short for ‘ordered’. This is variant of categorical variable where the categories have a particular order (responses like “Wost” < “Better” < “Best” could be stored as anord)
Two sample t-test
mtcars contains a variable called mpg, which is the miles per gallon each car will do, and another called am which is encodes whether it was a manual or automatic transmission (0=automatic, 1=manual).
We can test if mpg differs between auto and manual cars with t.test:
t.test(mpg ~ am, data=mtcars)
>
> Welch Two Sample t-test
>
> data: mpg by am
> t = -3.7671, df = 18.332, p-value = 0.001374
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> -11.280194 -3.209684
> sample estimates:
> mean in group 0 mean in group 1
> 17.14737 24.39231Explanation
The command contains three parts:
t.test: Says what we want to dompg ~ am: This is a ‘formula’, which tellst.testwhich variables to analyse.data=mtcars: Which dataset we want to use for the analysis
The formula is split into two parts by the ~ symbol. On the left is our outcome. On the right is the grouping variable, which we hope predicts the outcome.
In the output you can see the test statistic, degrees of freedom and p value.
The tilde symbol. Pronounced “tilder”.
In R, ~ almost always means “is predicted by”.
Correlations
The mtcars data also contains variables for weight (wt) and power (hp, short for horsepower).
We can select just these columns and save them to a smaller dataframe like this:
carperformance <- mtcars %>% select(mpg, wt, hp)Explanation of the commands
On the far left we have the name of the new variable which we will create: carperformance.
We can tell this will be a new variable because the <- symbol is just to the right, pointing at it.
To work out what carperformance will contain, we look to the right of the <- There are two parts here, linked with the pipe symbol (%>%) which passes data from one command to the next, from left to right.
First we see the mtcars data. Using a pipe we pass this to the select command, which selects the mpg,wt, andhp` columns.
Explanation of the result
When running the command you won’t see any output — but something has happened behind the scenes: A new object was created called carperformance which contained copies of the columns from mtcars we selected.
We can see the first few rows of our new smaller dataframe like this:
carperformance %>% head()
> mpg wt hp
> Mazda RX4 21.0 2.620 110
> Mazda RX4 Wag 21.0 2.875 110
> Datsun 710 22.8 2.320 93
> Hornet 4 Drive 21.4 3.215 110
> Hornet Sportabout 18.7 3.440 175
> Valiant 18.1 3.460 105To correlate the three columns in this dataset, we can use the cor function and round all the results to 2 decimal places:
carperformance %>% cor() %>% round(2)
> mpg wt hp
> mpg 1.00 -0.87 -0.78
> wt -0.87 1.00 0.66
> hp -0.78 0.66 1.00On the left we have the carperformance data.
We pipe this to the cor function which calculates the correlation between each pair of columns and returns a special kind of table, called a matrix.
To make the output simpler, we then pass the results to the round function, which rounds all the results to 2 decimal places.
The cor function is pretty bare-bones, and doesn’t produce output we could easily use in a report or article. The apaTables package helps us with this:
apaTables::apa.cor.table(carperformance, filename = "correlations.doc")
>
>
> Means, standard deviations, and correlations with confidence intervals
>
>
> Variable M SD 1 2
> 1. mpg 20.09 6.03
>
> 2. wt 3.22 0.98 -.87**
> [-.93, -.74]
>
> 3. hp 146.69 68.56 -.78** .66**
> [-.89, -.59] [.40, .82]
>
>
> Note. M and SD are used to represent mean and standard deviation, respectively.
> Values in square brackets indicate the 95% confidence interval.
> The confidence interval is a plausible range of population correlations
> that could have caused the sample correlation (Cumming, 2014).
> * indicates p < .05. ** indicates p < .01.
> Sometimes we load a whole package, as we did when we wrote library(tidyverse) above. This is a good idea when we want to use lots of functions from that package.
When we only want to use one function from a package we can type nameofpackage::nameoffunction and this lets us use the function without loading the package.
This can be a good idea if the package or function is less well known, and you want to be explicit about which package it comes from—it helps ‘future-you’ work out what your code is doing.
Explanation of the result
We used the apa.cor.table function within the apaTables package to create a nicely-formatted correlation table, in APA format.
We also specified a filename, and apa.cor.table created a Word document with this name containing the formatted table (click to see the result).
Use one of the other built-in datasets in R to run a correlation between 2 variables.
Use the built-in
sleepdata. Compute a t-test comparing theextravariable between groups. Describe the results of the t-test in APA format.
Exploring the built in datasets
Some examples of other built-in data are:
sleepirisairqualityChickWeightdiamonds
If you have loaded tidyverse you can access these by name, just like the mtcars data. For example::
diamonds %>% glimpse()
> Rows: 53,940
> Columns: 10
> $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0…
> $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ve…
> $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I…
> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1,…
> $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6…
> $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5…
> $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3…
> $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4…
> $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4…
> $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2…You can find out more about each of them by typing:
help(iris)Explore some of the built in datasets in R.
Apply the commands for summarising, grouping and plotting these data that we have learned today to describe the data within them.
Some of this material was adapted from Andy Wills’ RminR.